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Ph . Abstract 

.,—t' In this chapter we review some recent results that shed hght on a fundamental question in 

1-^ . molecular systematics: how much phylogenetic 'signal' can we expect from characters that have 

^^' evolved under some Markov process? There are many sides to this question and we begin by 

describing some explicit bounds on the probability of correctly reconstructing an ancestral state 

from the states observed at the tips. We show how this bound sets upper limits on the probability 

^ . of tree reconstruction from aligned sequences, and we provide some new extensions that allow 

OO ' site-to-site rate variation or a covarion mechanism. We then explore the relationship between 

■"si" . the number of sites required for accurate tree reconstruction and other model parameters - such 

^^ ' as the number of species, and substitution probabilities, and we describe a phase transition that 

occurs when substitution probabilities exceed a critical value. In the remainder of this chapter 
^^ ' we turn to models of character evolution where the state space is assumed to be either infinite 

(^ . or very large. These models have some relevance to certain types of genomic data (such as 

gene order) and here we again investigate how many characters are required for accurate tree 

reconstruction. 



o 

> ! 1 Introduction 

X 

^ ' As biologists delve deeper into the evolutionary history of life they often find that sequence data 

provides conflicting or unclear phylogenetic information. For DNA sequences that have a high 
site substitution rate the problem of site saturation is well known, whereby certain sequences are 
essentially random with respect to each other due to the number of substitutions that have occurred 
during their evolution from a common ancestral sequence. For other sorts of data - such a gene 
order data, where genomes have undergone much reshuffling - a similar eventual randomization and 
loss of information also occurs. 
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The phenomenon of randomization, and the rate at which it occurs, have been weh studied in 
the probabihty hterature - see for example Diaconis [12]. In this setting it is often useful to regard 
the stochastic process as a random walk on a group. For example, card shuffling, or (unsigned) gene 
order rearrangement may be viewed as a random walk on the symmetric group on n elements (i.e. 
the group consisting of all n! permutations on n elements, under composition) while site substitution 
in DNA sequences of length k may be regarded as a random walk on the group (Z2 x 1^2)^ (since 
the three types of DNA substitutions - transitions and the two types of transversions - together 
with an identity forms a group under the operation of composition that is isomorphic to group 
with elements (0, 0), (0, 1), (1, 0), (1, 1) under componentwise addition ; a link that was first noted 
and exploited by Evans and Speed [16]). An alternative setting to a 'random walk on a group' 
is to consider a random walk on a finite regular connected graph, and most of the examples we 
have just mentioned can also be viewed from this perspective. Either setting - a random walk on 
a group, or a random walk on a graph - is just a special type of ergodic Markov chain, for which 
the usual questions arise, such as what is the limiting distribution, and how fast does the chain 
approach this limit? Often there is an abrupt transition from non-random to random in a sense 
that can be formalized and proved. For example, with binary sequences of length k (where k is 
large) under a model of independent site substitution, this transition occurs when each site has 
undergone approximately | logg (n) substitutions - beyond this point the derived sequence quickly 
becomes essentially random with respect to the first (for a precise rendition of this statement see 
[12], Theorem 3, p. 28). A similar type of transition for gene order rearrangement under random 
inversions was recently derived by [13]. 

While these questions have been well understood for Markov chains, they have been less thor- 
oughly investigated for the more general setting of Markov processes on trees. 

The situation here is interesting for the following reason - as the tree gets larger each leaf tends 
to become further from the root (and so conveys less information about the ancestral root state) 
yet the number of leaves also gets larger. It is, a priori, not clear whether the gain in information 
provided by more leaves compensates for the losses experienced by each leaf. This question is 
also familiar in biology - does the sampling of more species provide a strategy for coping with 
site saturation? As we will see, these questions are relevant not just for reconstructing ancestral 
character states, but also for inferring phylogenetic trees. 

Evolution processes may be often viewed as Markov processes on trees. These processes are in 
turn a special family of Markov random fields on trees, the study of which is an important branch 
of statistical physics - see [22] for general background and [26, 15, 38, 42, 33] for results regarding 
Markov processes on trees. The theory of Markov random fields (and processes) on trees is used 
to investigate problems such as ancestral reconstruction of states, which is familiar in both biology 
and physics. In contrast, the problem of reconstructing the tree topology, which is well-studied in 
biology, seems not to have been addressed in the statistical physics literature. 

In this chapter we survey some of the recent advances in the information-theoretic treatment 
of Markov processes on trees. We begin by dealing with Markov processes on a fixed (small) state 
space - for example nucleotide sequence data. Here we describe information-theoretic limits that 
place bounds on the extent to which ancestral states and deep divergences can be resolved from 
sequence data. We also consider the question of how much sequence information is required to 
accurately reconstruct a tree, a question where there remains an interesting unresolved issue. We 



then turn to the analysis of characters on state spaces that are large or infinite, and which exhibit 
a somewhat different (and more tractable) behavior. Along the way we will indicate how such 
character data may be relevant to the analysis of genomic data such as gene order. 



2 Preliminaries 

In this section we describe some background and notation concerning phylogenetic trees and Markov 
processes on trees - readers familiar with these topics may wish to skim over this material. 

2.1 Phylogenetic trees 

Throughout this chapter X is a finite set and we will let n = |X|. A phylogenetic X-tree (or more, 
briefly, a phylogenetic tree) is a tree T = (V, E) having leaf set X, and for which the interior vertices 
are unlabelled and of degree at least 3. If in addition each interior vertex has degree exactly 3 we 
say that T is trivalent. In evolutionary biology, the set X typically represents the extant species 
(or sequences) while the remaining vertices of the tree represent speciation events (or unknown 
ancestral sequences). Trivalent trees (also sometimes called 'fully-resolved') are regarded as the 
most informative as they contain no 'polytomies' (vertices of degree > 3 that generally represent 
uncertainty as to the actual order of speciation). 

Two phylogenetic X-trees T and T' are regarded as equivalent if the identity map on X, 
regarded as a bijection from the set of leaves of T to the leaves of T' extends to a graph isomorphism 
between the two trees. Thus, for example, there are precisely three trivalent (and one non-trivalent) 
phylogenetic X~trees for any set X of size 4. Less formally, two phylogenetic X-trees are equivalent 
if they describe the same graphical relationships between the species in X, even though the trees 
might be drawn differently in the plane. 

We are also interested in rooted phylogenetic X-trees. Briefly, a rooted phylogenetic tree is 
obtained from a phylogenetic tree by either distinguishing some interior vertex as a root, or by 
subdividing an interior edge and calling the new degree- two vertex a root. We denote the root of a 
rooted phylogenetic tree T by p, and direct all edges away from the root. For a rooted phylogenetic 
tree T we will use throughout this chapter the word topology to denote the associated unrooted 
phylogenetic tree (obtained from T by suppressing the root, and if it is of degree 2 identifying its 
two incident edges). A rooted phylogenetic tree is said to be binary if each non-leaf vertex has 
precisely two outgoing arcs. Thus a phylogenetic tree is binary precisely if its topology is trivalent. 
For more background on the mathematics of phylogenetic trees the reader is referred to [51]. 

2.2 Markov processes on trees 

Let C be the set of character states (such as C = {0, 1}, C = {A, C, G, T} or C = {20 amino acids}). 
In keeping with biological convention we will often refer to a site aligned across a set of species X 
as a character on X; mathematically it is simply a function from X to C. To model the evolution 
of characters on a rooted phylogenetic tree T by a Markov process we associate to each directed 
edge e of T a matrix M(e) of transition probabilities, and to the root vertex of T we associate a 



distribution vr of states (see [14] or [53] for a more formal description of the model). 

Many of the standard models in biology satisfy M(e) = exp(t{e)Q), where Q = iqi,j)iec,jec is 
the transition rate matrix and t{e) represents the 'length' of the edge e over which the Markov 
process operates. Furthermore, vr is generally taken to be the equilibrium distribution that satisfies 
ttQ = 0, so as to induce a stationary Markov process. 

The simplest 2-state model is the symmetric Cavender-Farris-Neyman (CFN) model 
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For this model the probability p{e) of a substitution on any edge e of the tree is given by 

^(e) = l(l-exp(-2t(e))). (1) 

With 4 states a slightly more general class of models is the Tajima and Nei's 'equal input' model 
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In case a = b = c = d{= r, say) this is known as the Jukes-Cantor model: 
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Both of these models lead to reversible Markov processes. See [19] for various other families of 
substitution matrices Q appearing in biology. 

A further embellishment of most contemporary models of nucleotide substitution is the inclusion 
of site specific rates. That is, one has a distribution D on some real-valued parameter (the 'rate' of 
evolution of a site) and each site i in the sequence evolves at a rate Aj that is chosen independently 
from this distribution. We refer to the distribution that assigns rate 1 to each site with probability 
1 as the degenerate distribution. 

The substitution process is therefore defined by a transition rate matrix Q, a distribution T> of 
site specific rates, a rooted phylogenetic tree T = {V, E, p), a collection of edge lengths t : E ^ TZ+ 
and a probability distribution vr on the states at the root vertex of T. 

A configuration a : V ^ C is a labeling of the vertices of T by C. We will write a^ for the value of 
a at the vertex v £ V. The distribution of dp is given by vr. If u is w's parent, then the conditional 
distribution of a^ given au at site i is given by the matrix M{e) = exp(Ajt(e)(5), where e = {u,v). 
We will denote the collection of leaves of the tree T by dT and the value of a configuration a at 
the leaves by ag (which is a character on X - that is, a function from X into the set C). 



3 Information-theoretic bounds: ancestral states and deep diver- 
gences 

In this section we describe explicit and easily computable upper bounds on the information that 
extant sequences provide concerning (i) ancestral sequences and (ii) the branching pattern deep 
inside a tree. These bounds are in a sense the simplest bounds that can be put on the reconstruction 
of ancestral states. 

For a leaf v, let path(f ) be the set of edges on the path connecting v to the root p, and let 

t{v)= Yl *(«)• 

egpath(D) 

The molecular clock assumption is that t{v) takes the same value for each v; we do not make this 
assumption anywhere in this chapter, even though we will refer to sums of t(e) values as (elapsed) 
'time'. 

Let TT be the prior distribution of the root character, and let 

A = supF[f{a9)=ap], (3) 

/ 

be the optimal probability of reconstructing the value of ap given ag, where the sup is taken over 
all functions. Assuming that the parameters of the model (i.e. T, the t(e) values and the root state 
distribution n) are known, it follows from a classic result (see for example Theorem 17.2 of [25]) 
that an optimal choice of / is the maximum posterior probability (MAP) estimator - that is, given 
ag one select the root state (s) j to maximize 

Pkakp = j] ■ Tr[ap = j] 

- a task that can be carried out by an efficient (polynomial-time in n) dynamic programming 
approach. 

It also follows from standard information-theoretic theory (Theorem 17.3 of [25]) that the fol- 
lowing lower bound on A applies: 

A > 2--^('^''l'"s) (4) 

where H{ap\aQ) is the conditional entropy of ap given ag is defined by 

H{ap\ad) = - ^ F[ap = i,aQ = a] log2(IP[crp = i\ad = a]). 



Note that in general one cannot expect to recover the root state with probability close to 1. 
Consider for example the Jukes-Cantor model. Even given the state of the children of the root 
there is a non-negligible probability that mutation events occurred along the two edges adjacent 
to the root and conditioned on this event the state of the root is independent from the rest of the 
character. 

As we will see in Section 4, there are various asymptotic results in statistical physics dealing 
with the limiting behavior of H{ap\aQ) and A, but the bounds on A in most of these results are 



not explicit. A notable exception is [15] where a bound on A for the CFN model is given in terms 
of "electrical-resistance" of an electrical network defined on the tree. 

However our main interest here is in providing explicit upper bounds on A, which we now 
describe. As the rate of substitution increases and/or the temporal separation of the root of 
the tree from the leaves increases, we would expect it to become increasingly difficult to recover 
the root state - a phenomenon well known to biologists as 'site saturation'. However it will be 
important (particularly for later results) to quantify this rate of decay of information. The following 
result, which is a slight extension of a result from [40], follows by easy adaptations of coupling 
arguments appearing earlier in statistical physics, see, e.g., [38]. We let Mx>{x) = 'Kx>[e^^] the 
moment generating function of the site specific rate distribution T>. Note that, for the degenerate 
site specific rate distribution we have Mj){x) = e^. 

Theorem 3.1. Consider a Markov model on a tree T , with transition rate matrix Q, edge lengths 
t{e) (for each edge e ofT), and site specific rate distribution D. Let 

Qj = miui^j Qij, q = Y,j Qj- (5) 

Then the optimal reconstruction probability A for the root state satisfies 

A < max7r[o-p = i] + V Mj){-qt{v)). (6) 

vedT 

Note that the first term in (6) is precisely the estimate one would make if one had no knowledge 
of the character states at the leaves of T. Thus Theorem 3.1 says that the improvement over 
this 'trivial' method decays as the expected exponential of —qt{v). Notice also that Theorem 3.1 
assumes that T and the values t{e) are all known exactly - if they are not, then the bound on A 
described applies a fortiori. 

The proof of Theorem 3.1 utilizes the method of coupling where one relates one stochastic 
process to another that is easier to analyze (see e.g. [1] for background on coupling for Markov 
chains) . The style of argument employed here has been applied to the study of percolation (see 
[38], and [47, 3] for background). We outline this argument now. First, we establish the result for 
the special case of constant site specific rate, where each site is assigned rate A with probability 
1. The substitution rate from state i to state j is given by qi^j. Recalling (5), we may define the 
process equivalently as follows. Given the current state i, 

(Jl) jump to state j with rate \qj] 

(J2) jump to state j with rate \{qi^j — qj)- 

The coupling argument relates this process (involving both (Jl) and (J2)) to the simpler process 
involving just (Jl). The crucial point here is that (Jl) is performed independently of the state i. 
For edge e = {u,v), let D{e) be the event that a transition of type (Jl) occurs along the edge e. 
Note that the events -D(e) are independent for different edges and that P[Z)(e)'^] = exp(— gAt(e)), 
where D{eY denotes the (complimentary) event that D{e) does not occur. Moreover, conditioned 
on D(e), cTj, is independent of dp. For a leaf v, let D{v) be the event that transition of type (Jl) 



occurs along an edge e G path(t;). Then 

F[D{vy]= Yl P[^(e)1 = n e-'?^*^'^) =e-«^*(^). 

eGpath(i') e£path(v) 

Finally, let D be the event that D{v) holds for all leaves v G dT. Then 

IP[^1 < Yl ^[D{vT] = Yl e-'^^*(^). (7) 

vedT v£dT 

Note that conditioned on D, cjq and cjp are independent. 

To prove the bound on reconstruction (6), note that if we are not given gq (or any other 
information on dp), then the best reconstruction function / satisfies f = j, where j maximized 
7r[(Tp = i] over all i, and this function has success probability maxj7r[(Tp = i]. Now let / be any 
reconstruction procedure and note that, condition on the event D, Up is independent of gq and 
therefore 

nf{<yd) = <yp] < P[Z?1+P[i?]P[/(a9) = ap|Z)] 

< ¥[0"] + F[D] max7r[ap = i] < ¥[0"] + max7r[(7p = i], 

i i 

and so 

¥[f{a9) = ap] < max^[ap = i] + Y ^"'^*^"^- (8) 

vedT 

Now, consider the case of a general site specific rate distribution D. Clearly, A is the expected 
value (with respect to V) of the conditional probability P[/((Tg) = cTpjA] which we may identify 
with the left-hand side of (8). Consequently, 

A < Ej)[maxTT[ap = i]]+Ej)[ V e"^^*^'')] = max7r[CTp = i] + V Mj){-qXt{v)) 

i ^ i ^ 

as required. ■ 

Example. To illustrate Theorem 3.1 let us consider the simplest model on four states, namely 
the Jukes-Cantor model (2) with a degenerate site specific rate distribution and a molecular clock. 
For this model the equilibrium distribution for states is uniform, so it is natural to take vrfcTp = «] = 4 
for all four choices of i. Now suppose we wish to infer the ancestral state at a vertex in a tree that 
was present t years ago, using the states observed now amongst the n extant descendant species. 
Theorem 3.1 provides the following bound on A: 

A < -- + ne""*, 
~ 4 

and we may identify the product |gt with the expected number of substitutions that occur on any 
path from the root to a leaf. For example, if the substitution rate is constant at (say) 1 substitution 
per million years, and we have a tree with n = 100 leaves whose root is at least 10 million years in 
the past then A < ^ + 0.0002 so a character tells us virtually nothing to help us estimate the state 
that occurred at the root. D 
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and for this matrix it is immediately clear that q = {). 







Notice that some restriction must be placed on the entries of Q for a bound such as that given 
by (6) to be useful. For example, consider a process with three states, with 7r[cjp = i] = 1/3 for 
each value of i, and with 

Q = 

for which q = 0. Then it can be checked that A > 2/3, however we also have maxj ir[ap = i] = 1/3 
so that for this example, A is always bounded away from maxj vrfcp = i]. 

However Theorem 3.1 can be extended to provide some (exponential-decay) bounds similar to 
(6) for certain choices of Q for which q = 0. A case in point is the class of 'covarion-type' models 
(see [21], [46], [60]) in which each state can either be in an 'on' mode or an 'off' mode. A state that 
is 'on' is free to change to other 'on' states, or to turn 'off' (at various rates), while a state that is 
'off' is only free to turn 'on' (at some rate). For two base states and therefore a total of 4 states, 
namely Oom lornOog, log the corresponding rate matrix Q can be written as: 



(9) 



In order to obtain bounds for such models, it is better to apply the coupling argument directly 
to the matrices M{e). Note that, for simplicity, we will also assume all the 'on' sites undergo 
substitution at the same rate (A = 1). Given any real matrix A let mj{A) = miniAij and 
m{A) = '^jiTT-j^A). Write mj{e) for mj{M{e)) and m{e) for m{M{e)). On the edge e, the 
transition process can be described equivalently as follows: Given the current state i, 

(Jl) jump to state j with probability rrij] 

(J2) jump to state j with probability rriij — rrij. 

Note that, as before, (Jl) is performed independently of the state i. Repeating the above argument 
we thus obtain the following bound on the reconstruction probability 

A <max7r[o-p = i]+ ^ JJ (1 - m(e)). (10) 

iie9T eGpath(D) 

For a given tree and substitution matrices we may apply bound (10) directly. However, unlike 
Theorem 3.1, here it is not enough to know for all leaves the total time elapsing from the root. 
Instead, all the edge lengths are needed. 

More can be said if the process described by Q is ergodic (maybe with entries) so that, for 
e > 0, exp(eQ) has all its entries positive. Let us assume that the length of all branches is at least 
e and let a = \J\ — r7T,(exp(e(5)) and note that a < 1. 



Note that if A and B are two stochastic matrices, then 1 — m{AB) < (1 — m{A)){l — m{B)). 
Thus, if t > e, then 

l-m(exp(tQ)) < 1-mf (exp(e[-JQ) j j < (a^jL^J < (q2)57 = a*A. 

Substituting this into (10) we obtain that 

A < max7r[cjp = i] + ^ a*(^)/^ (11) 

v&dT 

Note the similarity between this expression and the one in Theorem 3.1. In particular, in order to 
apply this bound it suffices to know for each leaf the total time elapsed from the root. 

Example. Consider the case where the rates in (9) are given hy ri = r2 = u = v = 1 per one 
million years. Note that since Q is symmetric the stationary distribution is given by the uniform 
distribution. 

Assume furthermore that length of all branches is at least e = 0.25. Using numerical analysis 
software (e.g. Mathematica) we find that 



exp(0.25 X Q) 



/ 0.646645 0.156621 0.175773 0.0209616 \ 

0.156621 0.646645 0.0209616 0.175773 

0.175773 0.0209616 0.801456 0.00180925 

V 0.0209616 0.175773 0.00180925 0.801456 / 



Therefore mi = m2 = 0.0209616 and m^ = m^ = 0.00180925. Thus, m(exp(0.25 x Q)) = 2 x 
0.0209616 + 2 X 0.00180925 = 0.0419232 and a = ^/l-m = 0.978814. 

Suppose we now have a tree with n = 100 leaves and we want to infer the ancestral state of a 
state that was present t million years ago. We thus obtain from (11) that 



A < max7r[o-„ = i] + na*/^ = - + IOOq^*. 



^ -'" 



In particular if t = 100 million years, the probability of reconstructing the ancestral state correctly 
is at most 0.25 + 0.000190498. So again, the character reveals essentially no information about the 
ancestral state. ■ 

3.1 Reconstructing deep divergences 

Theorem 3.1 allows one to place bounds on the extent to which sequences can resolve a divergence 
event deep inside a phylogeny. Consider for example four monophyletic groups of taxa for which 
we have aligned sequences of length k. We may wish to determine which of the three possible 
phylogenetic trees connect these four groups, as illustrated on the left of Fig. 1. 

Clearly, it will only help us in this task if we know the tree topologies of each of the four 
monophyletic groups together with their t{e) values. Each sequence site provides a portion of 
information concerning the 'deep' tree structure (i.e. which of the three possible phylogenetic trees 
connect the four subtrees) and it is possible to explicitly bound the information that the entire 




T-(s) 




Figure 1: Left: An example of a deep divergence involving four subtrees. Centre and Right: The 
tree T{s) and forest T'^{s) 

sequences provide concerning this divergence. In this way one can set explicit lower bounds on 
the number of sites would be needed in order to resolve a deep divergence. One such bound was 
described, for the CFN model, in [52]. Here we describe a more general approach from [40] that 
applies to a wider range of models and settings. 

Let T(s) denote the topology of tree T up to time s from the root, and let T'^is) denote the 
forest consisting of the subtrees from time s to the present (including the associated edge lengths). 
In other words, T{s) describes all divergences up to time s, while T'^is) describes all divergences 
(and their relative separations) from time s, as illustrated on the center and right of Fig. 1. 

Consider the problem of reconstructing T(s) (given T'^(s)) from a sequence of characters that 
are generated by a common Markov process on T, where the prior distribution on T{s) is given by 
a measure n. The prior /i is on T{s) with its edge lengths. However, for a tree topology T, we will 
write /i[T(s) = T] for the prior probability that the topology of T{s) is given by T. 

Note that in the following result (Theorem 3.2) we do not need to assume independence between 
sites that evolve according to this process on T. 

Let us denote a sequence of k identically generated configurations by a^, . . . ,a^. We will also 
denote the values of the configuration a* at the leaves by a^. Similarly, we denote by a^ the 
value of the configuration a* at the root p. Suppose furthermore, that the characters evolve as in 
Theorem 3.1 with substitution matrix Q; and we have a site specific rate distribution "D. Let A (s) 
be the probability of reconstructing, given T'^{s) (with its associated t{e) values) the tree topology 
up to time s, 



A' 



supP[/((c7; 
/ 



i^U 



ns)\T%s)]. 



(12) 



The sup is taken over all functions, and as before, the optimal choice of / is the maximum posterior 
probability (MAP) estimator, which given (cg)^^^ selects a tree T' to maximize 



i{Tis)=T'}nH)U\n^)'T'(')]MT{s)), 



(where l{T(s)=r'} is 1 or depending on whether the topology of T(s) is T' or not). Clearly the 
probability of reconstructing T from (ag)^^^ is less or equal to A-^(s); this latter quantity, which 
is the probability of correctly determining the 'deep' part of the tree, can be bounded as follows. 



10 



Theorem 3.2. Suppose that k sites evolve under a Markov process with a site specific rate distri- 
bution D. Then, for any s > we have: 

A^(s) < max^[r(s) = T] + A; ^ Mx,{-q{t{v) - s)), (13) 

vGdT 
where q is given by (5). 

Outline of the proof. The argument follows similar lines to the proof of Theorem 3.1. For 
character i we say that event Di occurs if, for all v G dT there exists a time t > s at which a 
transition of type (Jl) occurs at least once on the path connecting v to the root of the component 
of T'^{s) that contains v. By the proof of Theorem 3.1 it follows that 

vedT 
where Aj is the rate (chosen from D) that site i evolves at. Consequently, 

P[A^] < Y. Mjy{-q{t{v) - s)), 
v£dT 

and so, by the Bonferroni inequality, 

IP[(niiA)1 < k Y Mv{-qit{v) - s)). 
vedT 

Now, conditional on H^^iDi, the two random variables ((T*)JL]^ and {o'q)^^^ are independent, and 
therefore, T{s) and ((Tg)^^;^ are independent. As in Theorem 3.1 we conclude that 

A^(s) < P[(niiA)1+IP[(niiA)]max^[r(s) =r] < fc ^ Mi,(-g(t(t;)-s)) + max^[r(s) =r], 

v£dT 

as required. D 

Example. To illustrate Theorem 3.2 let us consider again the Jukes-Cantor model (2), with a 
degenerate site specific rate distribution and molecular clock. Suppose we have four monophyletic 
groups of taxa - each with 100 extant species, and with a well-specified tree with edge lengths - 
and we wish to determine which of the three possible trees (choices for T{s)) describes how the 
trees are joined ancestrally (as in Fig. 1). In the absence of any prior information it is natural to 
take /u[T(s) = T'] = | for each of the three possible trivalent trees T'. Suppose it is believed that 
all four lineages existed as far back as (at least) 1 billion years ago, and taking (for example) a 
site substitution rate (3r) of one substitution per fifty million years, we have for any leaf v that 
qt{v) = Art{v) = | • {3r)t{v) = | • 20. Theorem 3.2 then gives A'^(s) < 5 + WOke''^^-'^ which 
implies that at least 700 million sites (!) would be required in order to have any hope of estimating 
the ancestral divergence with probability more than about 0.5. This is perhaps not too surprising 
given that the expected number of substitutions per site along the path from the root to any leaf 
is 20. D 

Remarks 
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(1) As noted above, Theorem 3.2 applies even when the sequence sites are not independent. It 
is possible to extend this theorem further to allow the sites to evolve according to different 
Markov processes. 

(2) In order to get a feeling for the asymptotic behavior of (13), fix s and assume that the tree 
has n = e^* leaves, all at time t. Here we take the asymptotics where t ^ oo (and therefore 
n -^ oo), while s,q and /3 are all constants. Also we assume a degenerate site specific rate 
distribution. Then 

J2 e-«(*(^)-^) = exp(sg) exp{-t{q - /?)). 

v£dT 

Therefore if g > /?, then by (13) if we want to reconstruct the topology up to time s with 
high probability, i.e., A (s) > m.ax.T fJ-[T{s) =T] + 5, where 6 > then we need that 

k > S exp{—sq) exp{t{q — /3)) = 5ex.p{—sq)n'''^~ . 

So the number of characters required grows polynomially with n. 

D 

3.2 Connection with information theory 

Similar bounds to the ones we have described so far can also be stated and derived using classical 
information theory. First we briefly recall the concept of mutual information. For random variables 
X and Y the mutual information between X and Y is defined by 

iix:Y)(= mx)) -.= Y.nx = .,Y = .Hog, (^fc^^) . 

Formally, I{X;Y) is the Kullback-Leibler separation of the joint distribution of X,Y and the 
product distribution of X and Y . Consequently, I{X] 1") > with equality if and only if X and Y 
are independent. Informally I{X;Y) measures the amount of information that Y carries about X 
(or conversely that X carries about Y). When I{X] Y) is small then the best method for inferring 
Y from X does little better than the best method that simply ignores X - a precise formalization 
of this claim is Fano's inequality (see [10] for more details). 

The quantity / has some generic properties that make it useful for analyzing the information 
loss of Markov processes. For example, suppose that X, Y and Z be random variables such that 
X and Z are independent given Y. Then I(X;Z) < iaim{I(X;Y),I(Y; Z)} (the 'data processing 
lemma') and I{{X,Z);Y) < I{X;Y) + I{Z;Y) (the 'subadditivity property'). By exploiting these 
properties one can derive information-theoretic analogues of Theorems 3.1 and 3.2 which we will 
now briefly describe. For convenience we will deal just with the degenerate site distribution in both 
cases. In the setting of Theorem 3.1 it can be shown that 

I(aa;a,)<log2|C| j; £-'?*(-). 
vedT 
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Similarly, in the setting of Theorem 3.2 it can be shown that 



I{T{s);{al%,\T%s))<k 



v£dT 



e 



For further details, and applications of these results, see [40]. 

The results described in this section may give the impression that phylogenetic information 
decays in a smooth fashion according to an interplay of time, substitution rate, and numbers of 
leaves in the tree. However as we explain in the next section there are underlying transitions in 
this behavior. 

4 Phase transitions in ancestral state and tree reconstruction 

There is an interesting change ('phase transition') in the behavior of Markov models of character 
evolution on trees as the probability of substitution on edges of the trees passes a certain critical 
value. This has been well studied in statistical physics and in information theory, in the context 
of broadcasting on trees. But it is also relevant to biology - particularly in attempting to recover 
information (ancestral states, branching order) deep within a tree, from observing the character 
states at the leaves. 

The transition is most easily explained, and has been most studied for the case of the 2~state 
symmetric process (the CFN model described above). 

To illustrate this transition between what is called the 'ordered' and 'unordered' phases of a 
Markov process on a tree, suppose we have a rooted binary phylogenetic tree T that has n = 2™" 
leaves that are at distance m from the root vertex, as indicated in Fig. 2. 




Figure 2: A character of the CFN process on a binary phylogenetic tree on 8 = 2^ leaves at distance 
3 from the root 



Under the CFN model (and with a degenerate site specific rate distribution) let 

^(e) := det(M(e)) = det(exp(t(e)Q)), 

where, for any square matrix M, det(M) denotes the determinant of M (the product of the eigen- 
values of M). A classic identity in linear algebra, Jacobi's identity, states detexp(M) = exp(tr(M)) 
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where tr(M) is the trace of M (the sum of the diagonal entries of M, which also equals the sum of 
the eigenvalues of M). Thus, 

e(e)=exp(t(e)tr(Q)) = e-2*(e). 

By (1) we have 6(e) = 1 — 2p{e). Now suppose that each edge of T has the same i(e) value, say t, 
and thereby the same 6(e) value, namely 6 = exp(— 2t). 

Let us further suppose that the distribution vr of states at the root is uniform (i.e. a fair coin 
toss) and that we wish to use the states ag = (ag) at the leaves of T to estimate the state Up at 
the root. This gives rise to an interesting contest as m (the height of the tree) increases - firstly, 
each leaf is becoming increasingly far from the root, and so the information that it carries about 
the ancestral root state decays to with increasing m. On the other hand, the number of leaves 
grows (exponentially) with m, and so although each leaf carries less information, it might be hoped 
that together they compensate for their individual losses. Which factor wins out depends critically 
on the value of 6. Evans et al. [15] established that, for 29^ < 1 the mutual information I(aQ,ap) 
converges to 0, as m tends to infinity (this result was first proven independently by [5] in a different 
formulation). Thus, eventually (as the root becomes increasingly 'deep' in the tree) it becomes 
impossible to estimate the root state with any better success than a blind guess, when 6 lies in 
this region. On the other hand, when 26'^ > 1 then I(aQ,ap) is bounded away from 0, so that 
information about the root 'survives' to the leaves, no matter how large the tree grows. In this 
case maximum likelihood estimation or majority rule estimation (i.e. select the root state that 
corresponds to the majority state at the leaves) suffices to recover some information right up to 
(but not including [44]) the critical value 26'^ = 1. 

Notice that this critical value translates to a common t(e) value of t = ^ log(2) and thereby to 
a common p(e) value of p = 2(1 75)- 

Curiously, the maximum parsimony approach for ancestral state reconstruction (i.e. select the 
root state that requires the fewest transitions to account for the leaf states) recovers information 
under the CFN model for values of p only up to g [7] . 

The situation for r-states models and for non-symmetric 2-state processes is more subtle. There 
is not any general criteria for deciding when the mutual information I(ag; Up) is converging to and 
when is it bounded away from 0. In fact, such criteria do not even exist for symmetric processes 
on more than 2 states or for general processes on 2 states. In the general setting, there are various 
conditions which imply that the mutual information either converges to or is bounded away from 
0. However, these conditions are not sharp. We describe an example of both types of conditions 
now. 

Suppose that M(e) = M for all e. Since M is a stochastic matrix, 1 is an eigenvalue of M. 
Let {1 = Al, . . . , Ar} denote the set of eigenvalues of M and let 6 = max{|A2|, . . . , |Ar|} (note that 
for the CFN model, this is consistent with the previous definition oi 6). A "spectral criterion" 
([29, 42]) implies that for any M if 26"^ > 1 then /(cg; ap) is bounded away from zero for all trees. 
This result is not tight in general (see [38, 42, 28]). 

In order to illustrate the spectral criterion consider the Jukes-Cantor model (2). Note that the 
eigenvalues of Q are (with multiplicity 1) and — 4r (with multiplicity 3). Thus if M = exp(tQ), 
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then the eigenvalues of M are 1 and e ^''*. Therefore, if the stochastic matrix M = ex.p{tQ) satisfies 

A := 2e-S'^* > 1 (14) 

then by the spectral criterion I{aQ;ap) is bounded away from zero for all trees. This should be 
compared to Theorem 3.2 which implies that if 

B := 2e-^''* < 1 (15) 

and the tree is of depth at least d, then the probability of reconstructing the ancestral states is 
bounded by 1/4 + (2e~^''*)'^. The expression 1/4 + (2e~^^*)'^ converges to 1/4 when d -^ cxd. Thus, 
condition (14) implies that then I(aQ;ap) is bounded away from zero for all trees, while condition 
(15) implies that I{ag;ap) converges to as d ^ cx). 

In the other direction, various conditions are derived in [38, 42, 33, 32] that imply that /(up; ag) 
converges to for various processes. The simplest of these conditions is given in [38] - this condition 
is closely related to the one given in Theorem 3.2. The results in [42, 33, 32] give sharper bounds 
for symmetric processes on more than 2 states and for general 2-state processes. 

Let us illustrate how Proposition 4.2 of [42] translates to the "Jukes-Cantor" setting. This 
proposition specialized for the Jukes-Cantor model asserts that if 

C ■■= ^-^=^ < 1 (16) 

2 + 2 

then /(o"a; cjp) converges to as ci ^ oo. Simple algebra shows that A < C < B. Thus (16) gives a 
weaker condition than 15) (and therefore a stronger result) implying that I{aQ;ap) converges to 0. 

4.1 The logarithmic conjecture 

Suppose we generate k characters independently and according to the CFN model (with degenerate 
site specific rate distribution), and ask how large k should be in order that, with probability at least 
1 — e we can correctly recover from these characters the topology of the underlying phylogenetic 
tree. Let A;min(e) be the smallest value of k that achieves this last property. Clearly A;min(e) depends 
on features of the generating tree, in particular the number n of leaves, and the assignment of t(e) 
values to the edges of this tree (it also depends on e, however we will regard this as a fixed small 
number). Any dramatic 'shortening' of an interior edge, or 'lengthening' of an exterior edge (i.e. 
making the t{e) value small or large, respectively) will cause kraini^) to diverge and so we will 
assume that each binary phylogenetic tree has all its t(e) values in some fixed interval [In, Un] which 
may depend on n. The questions of interest are then to determine the dependence of kmmi^) on 
n and the values {ln,Un)- Essentially this question provides another formalization of the question 
'how much phylogenetic information is contained in characters that evolve according to a simple 
Markov model.' The authors of [14] showed that, 

femin(e) < 4 72 exp(u„5„(r)) (17) 

where c[ is a constant (dependent only on e) and 5n{T) is a function (only) of the phylogenetic 
tree T and that grows slowly with n. Specifically, 6n(T) is at most a constant times log(n), but is 
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typically (i.e. on average) 0(log(log(n)). It is a measure of how many edges of the tree separate 
the 'deepest' vertex from its nearest leaf. 

Thus if we were to regard In and Un as constants (independent of n) then fcmin(e) is at worst 
polynomial in n, and more typically a power of log(n) (improving an alternative bound described 
in [17]). We have not mentioned the tree reconstruction method used to establish (17); it is a 
polynomial time (in n = |X|) algorithm, and chosen more for tractability of analysis than for any 
supposed superior performance; a comparable analysis for maximum likelihood seems more difficult 



An obvious question arises: is the bound on A;min(e) described by (17) and the consequent 
relationship between A;niin(e) and n (for ln,Un fixed) optimal? Certainly A;min(e) must grow at least 
as fast as (a constant times) log(n), by elementary counting arguments. This applies under any 
model of sequence evolution on a bounded state space and any tree reconstruction method [57]. 
The essence of this argument is the following: there are , _2V2"-^ trivalent phylogenetic X-trees 
and r collections consisting of n aligned sequences of length k on an r-letter alphabet, and so 
if /c = o(log(n)) then for sufficiently large n there exist more trivalent phylogenetic X-trees than 
r-letter sequences of length k. 

Also an inverse square dependence of A;inin(e) on /„ is necessary, even when n = 4, as shown in 
[58]. However there is reason to believe that (17) is not optimal, provided that Un is less than the 
critical transition value (viz. ^ logg(2)) between the ordered and unordered states, discussed above. 
This has led to the following conjecture, which promises a remarkable strengthening of (17) under 
a further restriction. 

Conjecture 4.1. Consider the CFN model for binary characters, and suppose that Un < u < 
ilog,(2). Then 

k ■ (eXc ■ ^°^^"^ 
where c^^u is a constant that depends only on e and u. 

Conjecture 4.1 is clearly true for trees for which Sn(T) is bounded - these are trees for which 
no vertex is very far from a leaf (for example, the class of 'caterpillar trees' which are the trivalent 
phylogenetic trees tfor which every interior vertex is adjacent to a leaf). However for trees that have 
'deep' vertices such as the complete balanced binary phylogenetic tree that has all its n = 2"^ leaves 
at distance m from a fixed central edge, bound (17) is polynomial in n. Yet precisely in this 'worst 
case' setting the bound promised by Conjecture 4.1 holds - this was recently established in [37], 
using an entirely different approach from [14]. The paper [37] also showed that the restriction on 
Un is necessary for Conjecture 4.1 to hold, for when n„ is allowed to take larger values, polynomial 
dependence of /cmin(e) on n can result. 

Conjecture 4.1 has been extended to a much more general conjecture in [37] concerning the 
transition from logarithmic to polynomial dependence of A;inin(e) on n for a range of Markov models 
at the corresponding transition from the ordered to unordered phase of the process. 
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4.2 Reconstructing forests 

Given that it may be difficult to reconstruct deep parts of a tree (for example, in the region where 
Conjectureconj does not apply) a more modest task may be to to reconstruct most of the tree that 
is not 'deep'. A natural question then is whether this can be achieved using a small (logarithmic 
in n) number of sites). 

Note that for the rooted binary tree on 2™ leaves, where all the leaves are at distance m from 
the root, only an 0(2~*) fraction of the vertices is at distance s or more from the set of leaves. 
This is true in general for binary trees - only a 0(2"'^) of the vertices are at distance s or more 
from the set of leaves. In other words, for all binary trees the "deep" part consists of exponentially 
small fraction of the tree. Therefore reconstructing the part which is not "deep" still contains a lot 
of information on (recent) divergences. 

It turns out that the answer to the above question is positive. In [41] it is shown that a 
logarithmic number of characters suffices to reconstruct a forest containing most edges of the tree. 
Moreover, [41] gives a formula relating the "depth" of the forest that can be recovered from a given 
number of characters. 



5 Processes on an unbounded state space: The random cluster 
model 

For the remainder of this chapter we will investigate the phylogenetic information that is provided 
by models which have a large state space. In this section we deal with a slightly idealized 'random 
cluster' model, in which the underlying state space might be regarded as being infinite - it has the 
property that any substitution always gives rise to a new state. We will see that this simple model 
is quite tractable and leads to a result (Theorem 5.1) that is much cleaner than anything that has 
been established yet for even the CFN model. We will apply this result in the final section of this 
chapter to investigate a class of models on large but finite state spaces. 

For the type of Markov model on a small state space that we have dealt with so far the subsets 
of the vertices of a phylogenetic tree T that are assigned particular states do not generally form 
connected subtrees of T (in biological terminology this is because of 'homoplasy' - the evolution of 
the same state more than once in the tree, either due to reversals or convergent evolution). 

However increasingly there is interest in genomic characters such as gene order where the un- 
derlying state space may be very large ([20], [35], [36], [48]). For example, the order of k genes in 
a signed circular genome can take any of 2^[k — 1)! values. In these models whenever there is a 
change of state - for example a re-shuffling of genes by a random inversion (of a consecutive subse- 
quence of genes) - it is likely that the resulting state (gene arrangement) is a unique evolutionary 
event, arising for the first time in the evolution of the genes under study. Indeed Markov models 
for genome rearrangement such as the (generalized) Nadeau- Taylor model [35], [45] confer a high 
probability that any given character generated is homoplasy-free on the underlying tree, provided 
the number of genes is sufflciently large relative to n = |X| ([50]). Here the phrase 'homoplasy-free' 
refers to the condition that a character has parsimony score equal to the number of states it takes 
at the leaves minus 1; this condition has a natural interpretation in biology, since it is equivalent to 
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requiring that the character could have evolved on the tree without reversals or convergent evolu- 
tion (for details of that connection see [50, 51]). In this setting a 'random cluster' model which we 
will describe here is the appropriate (limiting case) model, and may be viewed as the phylogenetic 
analogue of what is known in population genetics as the 'infinite alleles model' of Kimura and Crow 

[30]. 

Thus for this section we consider the size of the state space to be infinite (or at least very large, 
and perhaps variable with n). Some of the arguments described above are no longer valid in this 
setting. For example, the simple argument in Section 4.1 that showed that /cmin(e) rnust grow at 
least as fast as the function log(n) does not apply when the size of state space is infinite, or finite 
but variable with n = \X\. Indeed it has recently been shown that for any trivalent phylogenetic 
X-tree T there is an associated set of just four characters for which T is the only phylogenetic 
X-tree on which each character in that collection has a homoplasy-free evolution (see [50], [27]). 
Thus it is reasonable to ask whether 0(1) characters might suffice to reconstruct T under a simple 
random model. We will see that the answer to this question is 'no', but clearly we need a different 
type of argument. 

Consider the following random process on a phylogenetic tree T. For each edge e let us indepen- 
dently either cut this edge - with probability p{e) - or leave it intact. The resulting disconnected 
graph (forest) G partitions the vertex set V{T) of T into non-empty sets according to the equiva- 
lence relation that n ~ f if f and v are in the same component of G. This model thus generates 
random partitions of V{T), and thereby of X by connectivity, and we will denote these partitions 
of V{T) and X using the symbols x and X) respectively. Fig. 5(b) illustrates this process. 





(a). (b). 



Figure 3: (a) A trivalent phylogenetic X-tree T for X = {1,2,... ,7}; (b) For the random cluster 
model, cutting the edges of T that are marked by a cross induces the character x on X given by 
X = {{1,3}, {2, 4,5}, {6}, {7}}. 



For an element x G X we will let xi^) denote the equivalence class containing x. We call the 
resulting probability distribution on partitions of X the random cluster model with parameters 
(r,p) where p is the map e i-^ J'(e). 

In keeping with the biological setting we will call an arbitrary partition x of X a character (on 
X). Let P[x|r, p] denote the probability of generating a character x under the random cluster 
model with parameters {T,p). We say a subset C of the set E{T) of edges of T is a cutset for x on 
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T if the partition x of X equals that induced by the components of {V{T), E{T) — C). Then 

nx\T,p] = Y.llpie) H (l-p(e)), (18) 

C eeC eeE(T)-C 

where the summation is over ah cutsets C for x on ^- Note that the number of terms in the 
summation described by (18) can be exponential with n = \X\. However by modifying the well- 
known dynamic programming approach for computing the probability of a character on a tree 
according to a finite state Markov process (see eg. [18]) one can compute P[x|T, p] in polynomial 
time in n = \X\. 

Note that the probability distribution described by (18) models the evolution of characters under 
the assumptions that any substitution is always to a new state, and with indepedence between 
substitution events on different edges of the tree. We will relate this model to more explicit models 
of character evolution (on a finite but large state space) in the next section. 

Suppose we generate a sequence 11 = (xi, . . . , Xk) of k such independent characters on X where 
the generating pair (T,p) is unknown. We wish to reconstruct T with probability at least 1 — e 
from n. 

The following theorem, from [43] describes how the required value of k is related to the size 
of T and properties of p, and illustrates the logarithm-polynomial phase transition that occurs 
depending on whether or not the p{e) values are all less than 1/2. We refer the reader to [43] for 
the proof of this result. 

Theorem 5.1. Let < I < u < 1 and < e < 1 be fixed constants. Consider the random 
cluster model on any collection of the parameters {T,p) where T is a trivalent phylogenetic tree, 
and I < p{e) < u for all edges e ofT. Let k be the number of characters generated i.i.d. under 
this model, and /cmin(e) ^e the minimal k such that the tree can be correctly reconstructed from the 
characters with probability at least 1 — e. Then, if n denotes the number of leaves of T. 

(i) fcmin(e) grows logarithmically with nifu<^. In particular, if 

then the tree can be reconstructed correctly with probability 1 — e. Furthermore, there is a 
polynomial-time (in n) algorithm for reconstructing T from the generated characters. 

(a) fcmin(e) can grow polynomially with nifly^. In particular, for all h, if 

then there exists a distribution on trivalent phylogenetic X -trees, such that if T is drawn 
according to the distribution, p{e) = I, for all edges of the trees, and characters are generated 
by {T,p), then the probability of correctly reconstructing T given the k characters is bounded 
above by e + 3~^^^ . 



19 



Theorem 5.1 shows that the situation with the random cluster model differs from a bounded- 
state spaces model such as the CFN model, in two respects. Firstly, the critical value is ^ instead 
of ^(1 — \=). This corresponds to the fact that in statistical physics models on the binary tree, the 

critical value for the extremality of the free measure or the Ising model is ^(1 — y=), see [5, 15, 39], 

while the critical value for uniqueness of Gibbs measure, or the critical value for percolation is ^, 
see [24, 47]. In [40] it is shown that for any Markov model, if the substitution rate is high then k 
depends polynomially on n. 

The second respect in which the random cluster model differs from the symmetric two state 
model, is that for the random cluster model, the dependence of /c on / has exponent —1 rather than 
-2, see [14, 37, 58]. 

We now provide a brief outline of the proof of part (i) of Theorem 5.1, since it combines a 
combinatorial result with a probabilistic percolation argument; full details can be found in [43]. 
Recall that a quartet tree is a trivalent phylogenetic X-tree for |X| = 4. We can represent any 
quartet tree by the notation xy\wz where x,y are leaves that are adjacent to one interior vertex, 
while w, z are leaves that are adjacent to the other interior vertex. For any trivalent phylogenetic 
X-tree, T let Q(r) denote the set of quartet trees induced by T by selecting subsets of X of size 
4. It is a fundamental result from [9] that T is uniquely determined by Q(T'). Suppose that T is a 
trivalent phylogenetic X-tree. We say that T displays a quartet tree xy\wz (respectively, a set Q 
of quartet trees) if xy\wz G QiT) (respectively, if Q C Q(T)). For example the tree T in Fig. 5(a) 
displays the quartet tree 12|47. For any three distinct vertices a,b,c of T let med(a, 6, c) denote 
the median vertex of the triple a, b, c; that is, the unique vertex of T that is shared by the paths 
connecting a and b, a and c and b and c. 

A collection Q of quartet trees is a generous cover of T if Q C Q{T) and if, for all pairs 
of interior vertices u,v there exists a quartet tree xx'\yy' £ Q for which u = med(x,a;',u) and 
V = med(M, y, y'). Given a sequence C = (xi, X2, • • • , Xk) of characters on X, let 

Q{C) = {xx\yy' : 3i G {1, . . . , /c} : Xi{x) = Xi{x) / Xi{v) = Xiiv')]- 

One of the main steps in the proof of Theorem 5.1 is to establish the following purely combina- 
torial result: 

Proposition 5.2. If Q is a generous cover of a trivalent phylogenetic X-tree T then T is the only 
phylogenetic X-tree that displays Q. 

Using a percolation-style argument, one can then show that provided k is at least as large as 
that specified in Theorem 5.1(i), an i.i.d sequence C oi k characters will, with probability at least 
1 — e have the property that Q{C) is a generous cover of T. 

A further extension of Proposition 5.2, along with a simulation-based study of the random 
cluster model has been described recently by [11]. 

6 Large but finite state spaces 

Finally, we turn to the question of how many characters one needs to reconstruct a large tree if 
the characters evolve under a Markov model on a large but finite state space. The results of this 
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section are based on [55] where further details can be found. 

As mentioned earher, many processes involving simple reversible models of change can be mod- 
elled by a random walk on a regular graph. To explain this connection, suppose there are certain 
'elementary moves' that can transform each state into some 'neighboring' states. In this way we can 
construct a graph from the state space, by placing an edge between state a and state (3 precisely if 
it is possible to go from either state to the other in one elementary move. The graph so obtained 
is said to be regular, or more specifically d-regular if each state is adjacent to the same number d 
of neighboring states. 

For example, aligned sequences of length N under the r-state Poisson model can be regarded 
as a random walk on the set of all sequences of length N over R; here an elementary move involves 
changing the state at any one position to some other state (chosen uniformly at random from the 
remaining r — 1 states). Thus the associated graph has r vertices and it is (r — l)A^-regular. 

As another example, consider a simple model of (unsigned) genome rearrangement where the 
state space consists of all permutations of length N (corresponding to the order of genes 1, . . . , A'^) 
and an elementary move consists of an inversion of the order of the elements of the permutation 
between positions i and j, where this pair is chosen uniformly at random from all such pairs between 
{!,... , N}. In this case the state space has size A! and the graph is d-regular for d = ( 2 ) • 

Both of the graphs we have just described have more structure than mere d-regularity. To 
describe this we recall the concept of a Cayley graph. Suppose we have a (non-abelian or abelian) 
group Q together with a subset S of elements, with the properties that Ig ^ S and s ^ S ^ s~^ G S. 
Then the Cayley graph associated with the pair {G,S) has vertex set Q and an edge connecting g 
and g' whenever there exists some element s £ S for which g = g' ■ s. To recover the above graph 
on aligned sequences of length A over an r-letter alphabet, we may take Q as the (abelian) group 
{'Lr)^ (the product of A^ copies of the cyclic group of order r) and the set S of all A-tuples that 
are the identity element of Z^ except on any one co-ordinate. To recover the graph described above 
for unsigned genome rearrangements we may take Q to be the (nonabelian) group consisting of all 
A! permutations on A letters and S to be the elements corresponding to inversions. 

The demonstration that such graphs are Cayley graphs has an important consequence - it implies 
that they also have the following property. A graph G is said to be vertex-transitive if, for any two 
vertices u and v there is an automorphism of G that maps u to v. Informally, a graph is vertex- 
transitive if it 'looks the same, regardless of which vertex one is standing at'. Clearly a (finite) 
vertex-transitive graph must be d-regular for some d, and it is an easy and standard exercise to 
show that every Cayley graph is vertex-transitive (however not every vertex-transitive graph is a 
Cayley graph, and not every regular graph is vertex-transitive). 

Suppose that i? is a group, and for some subset S (closed under inverses and not containing the 
identity element of R) we have a rate matrix Q (as discussed earlier) for which Qap = <? if and only 
if there exists some element s G 5 for which [3 = a ■ s, otherwise for any distinct pair a, (5 we have 
Qap = 0. Such a process we will call a group walk process (on the generating set S). Group walk 
processes have a useful property on trees: for each arc e = {u, v) of T = {V, E) consider the event 
A(e) that the state that occurs at v is different from the state that occurs at u (i.e. there has been 
a net transition across the edge). Then using the fact that the Cayley graph for (R, S) is vertex 
transitive, it is easily shown that the events (A(e),e G E) are independent. 
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For these models we then ask how hkely it is that a character evolves without homoplasy on 
a tree. This question has been investigated for the 2-state Poisson model (and pairs of taxa) by 
Chang and Kim [6]. Here we consider more general processes on a larger state space, and for many 
taxa - consequently we obtain bounds rather than the exact expressions that are possible in the 
simpler setting of [6]. The proof of the following lemma is straightforward (for details, see [55]). 

Lemma 6.1. Let {Xt;t > 0) be a group walk process on generating set of size d. Then, for any 
two distinct states a, (5, and any values s, t > 0, 

nXt+s = fi\Xt = a]<^. 

The following result shows that for a group- walk process if the size of the generating set is much 
larger than 2n^ (where n is the number of species) then any character generated on a tree with n 
species will almost certainly be homoplasy-free on that tree. 

Proposition 6.2. Suppose characters evolve on a phylogenetic tree T according to a group walk 
process on a generating set of size d. Let p{T) denote the probability that the resulting randomly- 
generated character x is homoplasy-free on T. Then 

rtr)>i-fi-"-^>(»-i' 



d 
where n = \X\. 

Proof. Consider a general Markov process on T with state space R. Suppose that for each arc 
(w, v) of T and each pair a, (3 of distinct states in i?, the conditional probability that state (3 
occurs at v given that a occurs at u is at most p. Then, from [50] (Proposition 7.1) we have 
p{T) > 1 — (2n — 3)(n — l)p. By Lemma 6.1 we may take P = 2- '^^^ result now follows. D 

We are now ready to state a result for certain Markov processes on large (but finite!) state 
spaces, which brings together several ideas presented above. Informally, Theorem 6.3 states that 
for a group walk process, a growth of around n^ log(n) in the size of the generating set is sufficient 
(with all else held constant) for producing a sequence of homoplasy-free characters that define T. 

Let Pmm = min{P[A(e)] : e G -E}, Pmax = max{P[A(e)] : e G E}, and for any e > let 

c. = ^^ (20) 



'-'max 



where /3 = PmmvT:r^ 

Theorem 6.3. Suppose characters evolve i.i.d. on a binary phylogenetic tree T according to a 
group walk process on a generating set of size d, where 

d > Ce ■ n log(n) 

with Ce given by (20) and with Pmax < 2- Then with probability at least 1 — 2e we can correctly 
reconstruct the topology of T by generating an appropriate (and logarithmic in n) number of char- 
acters. 
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Proof. Let us generate k = [■glog(-^)] characters under a group walk process on a rooted phy- 
logenetic tree. Consider the event H that all of these characters are homoplasy-free on T. By 
Proposition 6.2 we have 

F[H] > (1 - (''^-^)}^-^)^k > (1 _ 2^ |iog(^)_ 
a Cf log(n) 

Now, applying the inequality (1 — x)^ > 1 — xy for x,y > 0, and straightforward algebra gives 

e(l + log(^))-i 1 

P[^] > 1 ^-r^ [log(n) + log(^)] > 1 - e, 

log(n) Ve 

using log(n) > 1. 

To relate the group walk process to the random cluster model we use a simple coupling argument. 
First consider any linear extension of the partial order induced by T on its vertices - i.e. impose a 
time-scale on the tree. To each vertex v assign the pair (a, i) where a is the state of the group walk 
process at v, and i G {0, 1, 2, ...} indicates how often this state has arisen from another state earlier 
in the tree. Note that this model always generates new states (it is homoplasy-free). Furthermore, 
let A{e) denote the event that the two vertices in e have different states under the coupled process. 
Notice that A(e) occurs precisely when the two vertices in e have different states under the original 
process. Thus, the collection {^(e) : e G E} is a collection of independent events. Consequently this 
coupled process is precisely the random cluster model, and we may identify (3 with the expression 
l{ {'_u )'^ ^^ Theorem 5.1(i). Furthermore the probability that T will be correctly reconstructed 
from k characters produced by the coupled process is at least 1 — e by Theorem 5.1 (recalling that 

Pmax ^ 2/' 

On the other hand the original k characters induce the same partitions as the derived characters 
whenever event H holds, and F[H] > 1 — e. Consequently, by the Bonferroni inequality, the joint 
probability that event H holds and that the k characters produced by the coupled process recovers 
T is at least 1 — 2e. Thus the probability that the original k characters recover T is at least this 
joint probability, and so at least 1 — 2e, as claimed. D 

We end this section with a remark of caution. Theorem 6.3 suggests a possible (but flawed) 
way to avoid the well-known statistical inconsistency of phylogenetic reconstruction methods such 
as maximum parsimony or maximum compatibility under simple models of site substitution. This 
approach is based on grouping sites into s-tuples. For example if the underlying process is the 
symmetric 2-state model, then if we take s-tuples of sites we obtain a group walk process on a 
generating set of size 2* — 1. So by taking s large enough Theorem 6.3 suggests that methods like 
compatibility (or maximum parsimony) should be able to correctly reconstruct a tree with high 
probability. The flaw in this argument is that as s increases, so too does Pmax and once Pmax exceeds 
2 Theorem 6.3 is no longer valid. In fact one can show that for the 2-state symmetric model and 
a tree with 4 leaves, the branch length requirements for the statistical consistency of maximum 
parsimony on site data is identical to that on s-tuple site data (Theorem 3A of [56]). 
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7 Concluding comments 

Stochastic models of character evolution have come to play an central role in phylogenetic genomics. 
In this chapter we have investigated various aspects of the broad question: how much information do 
characters tell us about their evolutionary history? There are two ways to investigate this issue - one 
approach that is popular in biology and bioinformatics is to use simulation (for example, generating 
sequences on a tree and then comparing the tree that is reconstructed from these sequences with 
the original tree) and several authors have taken this approach (eg. [8]). Here we have adopted an 
analytical approach, that gives provable and general bounds on quantities of interest. This approach 
has both advantages and disadvantages over a simulation-based one. Its main disadvantage is that 
the bounds we obtain may sometimes be far from being 'exact', and in that case simulations 
provide a useful way to detect and quantify this (as was undertaken for the random cluster model 
in [11]). However analytic approaches have some unique advantages: the results are generic and 
not specific to particular parameter settings in the simulation runs; indeed some of the bounds we 
presented apply to any tree reconstruction method, something that would be difficult or impossible 
to investigate using simulations. Furthermore, analytic results can show precisely the functional 
dependency between quantities of interest (for example the dependence of k on log(n) in the random 
cluster model). Analytic approaches have one more virtue - the arguments used provide insight 
into how information is preserved and lost, and what methods might be most effective in recovering 
it. 

Turning to possible future work, we expect that the conjecture(s) outlined in section 4.1 will 
likely be resolved in the near future. Another challenging mathematical task would be to provide 
explicit lower bounds on the tree reconstruction probability for maximum likelihood - the only 
known bounds (from a more general result in [58] that is not specific to phylogeny) involve expo- 
nential dependence of /c on n and can likely be improved considerably in the phylogenetic setting. 
For applications, it would also be useful to be able to estimate reconstruction probabilities (or 
the sequence length required to achieve a given reconstruction probability) from data, perhaps by 
exploiting bootstrap statistics. Although there has been some empirical approaches to this problem 
(eg. [31]) little has been rigorously established. 
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